Using Large Multi-purpose Corpora for Specific Research Questions: Discourse Phenomena Related to Wh-questions in the Spoken Dutch Corpus

نویسندگان

  • Nelleke Oostdijk
  • Lou Boves
چکیده

In this paper, we investigate whether a dataset derived from a multi-purpose corpus such as the Spoken Dutch Corpus may be considered appropriate for developing a taxonomy of wh-questions, and a model of the way in which these questions are integrated in spoken discourse. We compare the results obtained from the Spoken Dutch Corpus with a similar analysis of a large random collection of FAQs from the internet. We find substantial differences between the questions in spoken discourse and FAQs. Therefore, it may not be trivial to use a general purpose corpus as a starting point for developing models for human-computer interaction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Interface between information structure and intonation in Dutch WH-questions

This study set out to investigate how accent placement is pragmatically governed in WH-questions. Central to this issue are questions such as whether the intonation of the WH-word depends on the information structure of the non-WH word part, whether topical constituents can be accented, and whether constituents in the non-WH word part can be nontopical and accented. Previous approaches, based e...

متن کامل

Harvesting Dutch Trees: Syntactic Properties of Spoken Dutch

In this paper, we report on quantitative research into certain word order phenomena in Dutch. In our research, we use the Spoken Dutch Corpus (CGN), a major new resource for research into contemporary spoken Dutch. After briefly introducing the primary data, the annotations added, and some of the tools to explore the primary data and the annotations, we illustrate how the Corpus may be utilized...

متن کامل

Investigating speech style specific pronunciation variation in large spoken language corpora

In the past, linguistic research was typically conducted on relatively small datasets that were specifically designed for the research at hand. Whereas to date many large spoken language corpora have become available, the usefulness of these corpora is still not fully established in linguistic research. The research reported on in this paper was conducted to illustrate the potential of large mu...

متن کامل

On the Usefulness of Large Spoken Language Corpora for Linguistic Research

In the past, fundamental linguistic research was typically conducted on small data sets that were handcrafted for the specific research at hand. However, from the eighties onwards, many large spoken language corpora have become available. This study investigates the usefulness of large multi-purpose spoken language corpora for fundamental linguistic research. A research task was designed in whi...

متن کامل

Automatic Acquisition of Lexico-semantic Knowledge for QA

We present an experiment for finding semantically similar words on the basis of a parsed corpus of Dutch text and show that the acquired information correlates with relations found in Dutch EuroWordNet. Next, we demonstrate how the acquired knowledge can be used to boost the performance of an open-domain question answering system for Dutch. Automatically acquired lexico-semantic information is ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004